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Abstract 



Supervised topic models utilize document's side information for discovering predictive 
low dimensional representations of documents. Existing models apply the likelihood-based 
estimation. In this paper, we present a general framework of max-margin supervised topic 
models for both continuous and categorical response variables. Our approach, the maxi- 
mum entropy discrimination latent Dirichlet allocation (MedLDA), utilizes the max-margin 
principle to train supervised topic models and estimate predictive topic representations 
that are arguably more suitable for prediction tasks. The general principle of MedLDA 
can be applied to perform joint max-margin learning and maximum likelihood estimation 
for arbitrary topic models, directed or undirected, and supervised or unsupervised, when 
the supervised side information is available. We develop efficient variational methods for 
posterior inference and parameter estimation, and demonstrate qualitatively and quantita- 
tively the advantages of MedLDA over likelihood-based topic models on movie review and 
20 Newsgroups data sets. 

Keywords: Topic models. Maximum entropy discrimination latent Dirichlet allocation, 
Max-margin learning. 



1. Introduction 

Latent topic models such as Latent Dirichlet Allocation (LDA) ( |Blei et ah 20031 have re- 
cently gained much popularity in managing a large collection of documents by discovering 
a low dimensional representation that captures the latent semantic of the collection. LDA 
posits that each document is an admixture of latent topics where the topics are represented 
as unigram distribution over a given vocabulary. The document-specific admixture propor- 
tion is distributed as a latent Dirichlet random variable and represents a low dimensional 
representation of the document. This low dimensional representation can be used for tasks 
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like classification and clustering or merely as a tool to structurally browse the otherwise 
unstructured collection. 



The traditional LDA (Blei et al. 2003) is an unsupervised model, and thus is incapable 



of incorporating the useful side information associated with corpora, which is uncommon. 
For example, online users usually post their reviews for products or restaurants with a rating 
score or pros/cons rating; webpages can have their category labels; and the images in the 



LabelMe (Russell et al. 20081 dataset are organized in different categories and each image 



is associated with a set of annotation tags. Incorporating such supervised side information 
may guide the topic models towards discovering secondary or non-dominant statistical pat- 



terns (Chechik and Tishby 20021, which may be more interesting or relevant to the users' 



goals (e.g., predicting on unlabeled data). In contrast, the unsupervised LDA ignores such 
supervised information and may yields more prominent and perhaps orthogonal (to the 
users' goals) latent semantic structures. This problem is serious when dealing with com- 
plex data, which usually have multiple, alternative, and conflicting underlying structures. 
Therefore, in order to better extract the relevant or interesting underlying structures of 
corpora, the supervised side information should be incorporated. 

Recently, learning latent topic models with side information has gained increasing at- 
tention. Major instances include the supervised topic models (sLDA) ( |Blei and McAuliffe 



20071 for regression!^ multi-class LDA (an sLDA classification model) (Wang et al. 20091, 



and the discriminative LDA (DiscLDA) ( [Lacoste-Jullien et al.[ 2008[ ) classification model. 
All these models focus on the document-level supervised information, such as document 
categories or review rating scores. Other variants of supervised topic models have been de- 



signed to deal with different application problems, such as the aspect rating model (Titov 



and McDonald 20081 and the credit attribution model (Ramage et al. 20091, of which the 



former predicts ratings for each aspect and the latter associate each word with a label. In 
this paper, without loss of generality, we focus on incorporating document-level supervision 
information. Our learning principle can be generalized to arbitrary topic models. For the 
document level models, although sLDA and DiscLDA share the same goal (uncovering the 
latent structure in a document collection while retaining predictive power for supervised 
tasks), they differ in their training procedures. sLDA is trained by maximizing the joint 
likelihood of data and response variables while DiscLDA is trained to maximize the condi- 
tional likelihood of response variables. Furthermore, to the best of our knowledge, almost 
all existing supervised topic models are trained by maximizing the data likelihood. 

In this paper, we propose a general principle for learning max-margin discriminative 
supervised latent topic models for both regression and classification. In contrast to the two- 
stage procedure of using topic models for prediction tasks (i.e., first discovering latent topics 
and then feeding them to downstream prediction models), the proposed maximum entropy 
discrimination latent Dirichlet allocation (MedLDA) is an integration of max-margin predic- 
tion models (e.g., support vector machines for classification) and hierarchical Bayesian topic 
models by optimizing a single objective function with a set of expected margin constraints. 
MedLDA is a special instance of PoMEN (i.e., partially observed maximum entropy discrim- 



ination Markov network) (Zhu et al. 2008b), which was proposed to combine max-margin 



1. Although integrating sLDA with a generalized linear model was discussed in jBlei and McAulifFe 20071 
no result was reported about the performance of sLDA when used for classification tasks. The classifi- 



cation model was reported in a later paper ( Wang et al. 2009 1 
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learning and structured hidden variables in undirected Markov networks, for discovering 
latent topic presentations of documents. In MedLDA, the parameters for the regression or 
classification model are learned in a max-margin sense; and the discovery of latent topics 
is coupled with the max-margin estimation of the model parameters. This interplay yields 
latent topic representations that are more discriminative and more suitable for supervised 
prediction tasks. 

The principle of MedLDA to do joint max-margin learning and maximum likelihood 
estimation is extremely general and can be applied to arbitrary topic models, including 
directed topic models (e.g., LDA and sLDA) or undirected Markov networks (e.g., the 
Harmonium ( Welling et al.[ 2004|), unsupervised (e.g., LDA and Harmonium) or supervised 
(e.g., sLDA and hierarchical Harmonium (Yang et al. 2007 1), and other variants of topic 



models with different priors, such as correlated topic models (CTMs) Blei and Lafferty 



(20051. In this paper, we present several examples of applying the max-margin principle to 
learn MedLDA models which use the unsupervised and supervised LDA as the underlying 
topic models to discover latent topic representations of documents for both regression and 
classification. We develop efficient and easy-to-implement variational methods for MedLDA, 
and in fact its running time is comparable to that of an unsupervised LDA for classification. 
This property stems from the fact that the MedLDA classification model directly optimizes 
the margin and does not suffer from a normalization factor which generally makes learning 
hard as in fully generative models such as sLDA. 

The paper is structured as follows. Section 2 introduces the basic concepts of latent topic 
models. Section 3 and Section 4 present the MedLDA models for regression and classification 
respectively, with efficient variational EM algorithms. Section 5 discusses the generalization 
of MedLDA to other latent variable topic models. Section 6 presents empirical comparison 
between MedLDA and likelihood-based topic models for both regression and classification. 
Section 7 presents some related works. Finally, Section 8 concludes this paper with future 
research directions. 



2. Unsupervised and Supervised Topic Models 

In this section, we review the basic concepts of unsupervised and supervised topic models 
and two variational upper bounds which will be used later. 



The unsupervised LDA (latent Dirichlet allocation) (Blei et al. , 2003) is a hierarchical 
Bayesian model, where topic proportions for a document are drawn from a Dirichlet dis- 
tribution and words in the document are repeatedly sampled from a topic which itself is 



drawn from those topic proportions. Supervised topic models (sLDA) (Blei and McAuliffe 
2007) introduce a response variable to LDA for each document, as illustrated in Figure [!} 



Let K be the number of topics and M be the number of terms in a vocabulary. (5 
denotes a K x M matrix and each (3^ is a distribution over the M terms. For the regression 
problem, where the response variable y G M, the generative process of sLDA is as follows: 

1. Draw topic proportions 0|a ~ Dir(a). 

2. For each word 

(a) Draw a topic assignment ~ Mult(0). 
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Figure 1: Supervised topic model (Blei and McAuliffe 2007). 



(b) Draw a word Wn\zn,P ~ Multi{(3z„)- 

3. Draw a response variable: y\zi-N,r],6'^ ~ N{ri^ z, 6'^), where z = ^/NJ2n=i'^n is the 
average topic proportion of a document. 

The model defines a joint distribution: 

D N 

p{9, z, y, W|a, /3, ry, d'^) = p{Od\a){Y\_P{zdn\Gd)p{wdn\zdn, P))p{yd\v^ Zd, S'^), 



d=i 



n=l 



where y is the vector of response variables in a corpus T> and W are all the words. The joint 
likelihood on T> is p{y,'W\a,P,r],6'^). To estimate the unknown parameters {a, fi,r],5'^), 
sLDA maximizes the log-likelihood logp(y, W|a, r/, J^). Given a new document, the ex- 
pected response value is the prediction: 



y = E[Y\wi;N, a, (3, tj, 5"^] = r]^ E[Z\wi;N , a, (3, 5"^ 



(1) 



where E[X] is an expectation with respect to the posterior distribution of the random 
variable X. 

Since exact inference of the posterior distribution of hidden variables and the likelihood 



is intractable, variational methods (Jordan et al. 19991 are applied to get approximate 
solutions. Let q{9,7.\"f,(j)) be a variational distribution that approximates the posterior 
z|a, /9, r/, 5^, y, W). By using Jensen's inequality, we can get a variational upper bound 
of the negative log-likelihood: 

C'{q) = -ii;,[logp(^,z,y,W|a,/3,7?,52)] - W(g(z,0)) > - log^y, W|a, /3, r/, J^), 

where 'H{q) = —Eq\[ogq] is the entropy of q. By introducing some independence assump- 
tions (like mean field) about the q distribution, this upper bound can be efficiently opti- 
mized, and we can estimate the parameters {a, (3,7], 5"^) and get the best approximation q. 



See (Blei and McAuhffe 20071 for more details. 



For the unsupervised LDA, the generative procedure is similar, but without the third 
step. The joint distribution is ^(6*, z, W|a, /?) = Y[d=iP{Od\(y){Y{n=iV{zdn\Od)v{wdn\zdn^ f^)) 
and the likelihood is p(W|a,/3). Similarly, a variational upper bound can be derived for 
approximate inference: 



r{q) = -Eq[\ogp{9,7.,W\a, 



W(g(z,^))>-logKW|a,/5), 
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where q{0,z) is a variational distribution that approximates the posterior z|a, /?, W). 
Again, by making some independence assumptions, parameter estimation and posterior 
inference can be efficiently done by optimizing C'^{q). See (Blei et al. , 20031 for more 
details. 

In sLDA, by changing the distribution model of generating response variables, other 



types of responses can be modeled, such as the discrete classification problem (Blei and 



McAuhffe 


2007 


Wang et al. 


2009 



mation in supervised LDA classification model are much more difficult than those of the 
sLDA regression model because of the normalization factor of the non-Gaussian distribu- 
tion model for response variables. Variational methods or multi-delta methods were used to 



approximate the normalization factor (Wang et al. 2009 Blei and McAuliffe 2007). Dis- 



cLDA ( Lacoste- JuUien et al. 20081 is a discriminative variant of supervised topic models 
for classification, where the unknown parameters (i.e., a linear transformation matrix) are 
learned by maximizing the conditional likelihood of the response variables. 

Although both maximum likelihood estimation (MLE) and maximum conditional likeli- 
hood estimation (MCLE) have shown great success in many cases, the max-margin learning 
is arguably more discriminative and closer to our final prediction task in supervised topic 
models. Empirically, max-margin methods like the support vector machines (SVMs) for 
classification have demonstrated impressive success in a wide range of tasks, including image 
classification, character recognition, etc. In addition to the empirical success, max-margin 
methods enjoy strong generalization guarantees, and are able to use kernels, allowing the 
classifier to deal with a very high-dimensional feature space. 

To integrate the advantages of max-margin methods into the procedure of discovering 
latent topics, below, we present a max-margin variant of the supervised topic models, which 
can discover predictive topic representations that are more suitable for supervised prediction 
tasks, e.g., regression and classification. 



3. Maximum Entropy Discrimination LDA for Regression 

In this section, we consider the supervised prediction task, where the response variables 
take continuous real values. This is known as a regression problem in machine learning. 
We present two MedLDA regression models that perform max-margin learning for the su- 
pervised LDA and unsupervised LDA models. Before diving into the full exposition of our 
methods, we first review the basic support vector regression method, upon which MedLDA 
is built. 



3.1 Support Vector Regression 

Support vector machines have been developed for both classification and regression. In this 
section, we consider the support vector regression (SVR), on which a comprehensive tutorial 
has been published by Smola and Scholkopf (20031. Here, we provide a brief recap of the 
basic concepts. 

Suppose we are given a training set V = {(xi, yi), • • • , (x/j, y/))}, where x G A" are inputs 
and y € M are real response values. In e-support vector regression (Vapnik 19951, our goal 
is to find a function /i(x) G T that has at most e deviation from the true response values y 
for all the training data, and at the same time as flat as possible. One common choice of the 



5 



Zhu, Ahmed, and Xing 



function family T is the linear functions, that is, /i(x) = r/'''f(x), where f = {/i, • • • , /i^} is 
a vector of feature functions. Each : — > M is a feature function, r] is the corresponding 
weight vector. Formally, the linear SVR finds an optimal linear function by solving the 
following constrained convex optimization problem 

1 ^ 
PO(SVR): min + + 

{yd-V~^f{^d) < e + Cd 
-yd + r/^f (xd) <e + Cd , 

where \\r]\\2 = rf^rj is the ^2-iiorm; ^ and ^* are slack variables that tolerates some errors 
in the training data; and e is the precision parameter. The positive regularization constant 
C determines the trade-off between the flatness of h (represented by the ^2-iiorm) and 
the amount up to which deviations larger than e are tolerated. The problem PO can be 
equivalently formulated as a regularized empirical loss minimization, where the loss is the 



so-called e-insensitive loss ( Smola and Scholkopf 2003 1 



For the standard SVR optimization problem, PO is a QP problem and can be easily 
solved in the dual formulation. In the Lagrangian method, samples with non-zero lagrange 
multipliers are called support vectors, the same as in SVM classification model. There are 
also some freely available packages for solving a standard SVR problem, such as the SVM- 



light (Joachims 1999). We will use these methods as a sub-routine to solve our proposed 
approach. 

3.2 Learning MedLDA for Regression 

Instead of learning a point estimate of ry as in sLDA, we take a more general [^Bayesian-style 
(i.e., an averaging model) approach and learn a distributioE[^g(77) in a max-margin manner. 
For prediction, we take the average over all the possible models (represented by r]): 

y^E[Y\wi..N,a,f3,6^] = E[r]^Z\wi..N,a,/3,6\ (2) 

Now, the question underlying the averaging prediction rule ^ is how we can devise 
an appropriate loss function and constraints to integrate the max-margin concepts of SVR 
into latent topic discovery. In the sequel, we present the maximum entropy discrimination 
latent Dirichlet allocation (MedLDA), which is an extension of the PoMEN (i.e., partially 



observed maximum entropy discrimination Markov networks) ( Zhu et al. 2008b I framework. 
PoMEN is an elegant combination of max-margin learning with structured hidden variables 
in Markov networks. The MedLDA is an extension of PoMEN to learn directed Bayesian 
networks with latent variables, in particular the latent topic models, which discover latent 
semantic structures of document collections. 



2. Under the special case of linear models, the posterior mean of an averaging model can be directly solved 
in the same manner of point estimate. 

3. In principle, we can perform Bayesian-style estimation for other parameters, like 5^. For simplicity, we 
only consider as a random variable in this paper. 
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There are two principled choice points in MedLDA according to the prediction rule 
Q: (1) the distribution of model parameter rj; and (2) the distribution of latent topic 
assignment Z. Below, we present two MedLDA regression models by using supervised LDA 
or unsupervised LDA to discover the latent topic assignment Z. Accordingly, we denote 
these two models as MedLDA^j^^ and MedLDAp^^^^^i. 

3.2.1 Max-Margin Training of sLDA 

For regression, the MedLDA is defined as an integration of a Bayesian sLDA, where the 
parameter ry is sampled from a prior po('?)) and the e-insensitive support vector regres- 



sion (SVR) (Smola and Scholkopf 2003). Thus, MedLDA defines a joint distribution: 
p{6,z,r],y,W\a, (3, 5'^) = po{r])p{6,z,y,W\a, [3,r],6'^), where the second term is the same 
as in the sLDA. Since directly optimizing the log likelihood is intractable, as in sLDA, we op- 
timize its upper bound. Different from sLDA, ry is a random variable now. So, we define the 
variational distribution q{6, z, ri\^, (j)) to approximate the true posterior p{9, z, r/|a, /3, 5^, y, W). 
Then, the upper bound of the negative log-likelihood — logp(y, W|a, /?, 5^) is 

d^'iq) ^ -E,[\ogp{e,z,,^,y,W\a,(3,5^)]-n{q{e,z,r,)) = irL(g(7?)||po(r/)) + i?,(,)[/:i,(3) 

where KL{p\\q) = Ep\[og{p/q)\ is the Kullback-Leibler (KL) divergence. 
Thus, the integrated learning problem is defined as: 

D 

Pl(MedLDA}„,) : min C'^iq) + C^C^d + Cd) 



s.t. Vd 



Vd 



-yd + E[n 'Zd]<e + Q, fi* 
^d > 0, Vd 

Q > 0, v^. 



where ^, fj,*,v,v* are lagrange multipliers; ^, ^* are slack variables absorbing errors in train- 
ing data; and e is the precision parameter. The constraints in PI are in the same form as 
those of PO, but in an expected version because both the latent topic assignments Z and 
the model parameters rj are random variables in MedLDA. Similar as in SVR, the expected 
constraints correspond to an e-insensitive loss, that is, if the current prediction y as in Eq. 
^ does not deviate from the target value too much (i.e., less than e), there is no loss; 
otherwise, a linear loss will be penalized. 

The rationale underlying the MedLDA j^^^ is that: let the current model hep{6, z, ry, y, W|a, (3,5'^), 
then we want to find a latent topic representation and a model distribution (as represented 
by the distribution q) which on one hand tend to predict correctly on the data with a suffi- 
cient large margin, and on the other hand tend to explain the data well (i.e., minimizing an 
variational upper bound of the negative log-likelihood). The max-margin estimation and 
topic discovery procedure are coupled together via the constraints, which are defined on the 
expectations of model parameters rj and the latent topic representations Z. This interplay 
will yield a topic representation that is more suitable for max-margin learning, as explained 
below. 

Variational EM- Algorithm: Solving the constrained problem PI is generally in- 
tractable. Thus, we make use of mean- field variational methods (Jordan et al. 1999) to 
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efficiently obtain an approximate q. The basic principle of mean-field variational methods 
is to form a factorized distribution of the latent variables, parameterized by free variables 
which are called variational parameters. These parameters are fit so that the KL diver- 
gence between the approximate q and the true posterior is small. Variational methods have 
successfully used in many topic models, as we have presented in Section 2. 

As in standard topic models, we assume q{9, z, 7/17, ^) = q{ri) HdLi q{Od\ld) Y{n=i ?(^dn|(Adn), 
where is a iiT-dimensional vector of Dirichlet parameters and each is a categorical 
distribution over K topics. Then, E[Zdn\ = (t>dn, E[r]^ Zd\ = ^[??]^(l/iV) En=i '^^n- We 
can develop an EM algorithm, which iteratively solves the following two steps: E-step: infer 
the posterior distribution of the hidden variables 0, Z, and r/; and M-step: estimate the 
unknown model parameters a, /?, and 5"^. 

The essential difference between MedLDA and sLDA lies in the E-step to infer the 
posterior distribution of z and r; because of the margin constraints in PI. As we shall see 
in Eq. ([5]), these constraints will bias the expected topic proportions towards the ones that 
are more suitable for the supervised prediction tasks. Since the constraints in PI are not 
on the model parameters (a, /?, and 5^), the M-step is similar to that of the sLDA. We 
outline the algorithm in Alg. [T] and explain it in details below. Specifically, we formulate a 
Lagrangian L for PI 

D D D 

L - £"^(9) + C ^(^, + - E + ^d-yd + Elri'Za]) - ^(^^(e + Cd + Vd- E[i^^ Z^]) 

d=l d=l d=l 

D N K 

+ Vd^d + V*^Cd) - ^diC^ 4>dij - 1), 

d=l i=l j=l 

where the last term is due to the normalization condition 'Ylif=i4'dij = li Vi,d. Then, 
the EM procedure alternatively optimize the Lagrangian functional with respect to each 
argument. 

1. E-step: we infer the posterior distribution of the latent variables 9, Z and 77. For 
the variables 9 and Z, inferring the posterior distribution is to fit the variational 
parameters 7 and cj) because of the mean-field assumption about g, but for rj the 
optimization is on q{rj). Specifically, we have the following update rules for different 
latent variables. 

Since the constraints in PI are not on 7, optimize L with respect to 7^ and we can 
get the same update formula as in sLDA: 

TV 

Y.'^dn (4) 

n=l 

Due to the fully factorized assumption of for each document d and each word i, by 
setting dL/dcpdi = 0, we have: 

/c^ri fli 1 I rri ^ iam 1 jj,r 1 2E[ri'^(j)d-i'n]+E[r]o'q] 
(pdi oc exp(£;[log6'|7] + E[logp{Wdi\f3)] +j^E[r]\ 2N^^ 

+ (5) 
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Algorithm 1 Variational MedLDA'' 

Input: corpus T> = {(y, W)}, constants C and e, and topic number K. 

Output: Dirichlet parameters 7, posterior distribution q{r]), parameters a, (3 and 5^. 

repeat 

/**** E-Step ****/ 
for d = 1 to Z) do 

Update 7d as in Eq. Q. 
for i = 1 to do 

Update 4>di as in Eq. 
end for 
end for 

Solve the dual problem Dl to get q{ri), and 
/**** M-Step ****/ 

Update (3 using Eq. Q, and update 5^ using Eq. ([s]). a is fixed as 1/K times the ones 
vector, 
until convergence 



where (j>d-i = X^n^^i ^dn', 7? o 77 is the element-wise product; and the result of exponen- 
tiating a vector is a vector of the exponentials of its corresponding components. The 
first two terms in the exponential are the same as those in unsupervised LDA. 

The essential differences of MedLDA'" from the sLDA lie in the last three terms in the 
exponential of (pdi- Firstly, the third and fourth terms are similar to those of sLDA, 
but in an expected version since we are learning the distribution q{rj). The second- 
order expectations E[ri'^ (1)4,-111] -E[r/or/] mean that the co- variances of r/ affect the 
distribution over topics. This makes our approach significantly different from a point 
estimation method, like sLDA, where no expectations or co-variances are involved in 
updating cpdi- Secondly, the last term is from the max-margin regression formulation. 
For a document d, which lies around the decision boundary, i.e., a support vector, 
either /i^ or ji^ is non-zero, and the last term biases (pdi towards a distribution that 
favors a more accurate prediction on the document. Moreover, the last term is fixed 
for words in the document and thus will directly affect the latent representation of the 
document, i.e., 'jd- Therefore, the latent representation by MedLDA'' is more suitable 
for max-margin learning. 

Let A be the D x K matrix whose rows are the vectors Zj. Then, we have the 
following theorem. 



Theorem 1 For MedLDA, the optimum solution of q(rj) has the form: 

= exp (7? 2^(^d -2)^[Zd] -r/ ri) 

d=l 
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whereEiA'A] = YFd=x^Wj\, andE[Z,Zj] = ^(E1i E„.^n '/'dn'^L+Eti diag{</., 
The lagrange multipliers are the solution of the dual problem of PI: 

D D 
Dl : max - log Z - e + ^5) + V yd[^J'd - fJ'd) 

u,u* ^ ' ^ ' 

d=l d=l 

s.t. yd: fid,lJ-d e [0,C]. 

Proof (sketch) Set the partial derivative dL/dq{rj) equal zero, we can get the solution 
of g(r/). Plugging q{r]) into L, we get the dual problem. ■ 



In MedLDA'', we can choose different priors to introduce some regularization effects. 
For the standard normal prior: po{r]) = M(0,I), we have the corollary: 

Corollary 2 Assume the prior pQ^r]) = J\f{0, 1), then the optimum solution of q{rj) is 

(?(7?)=AA(A,S), (6) 



where A = S( J2d=iil^d - l^d + f)E[Zd]) is the mean and S = (/ + 1 / 6^ ElA'^ Ajy^ 
a K X K co-variance matrix. The dual problem of PI is: 

D D 



IS 



max - ^a^Sa-e^(/id + 
^''^ ^ d=i d=l 

s.t. Vd : lid, lid e [0,C7], 

where a = - + ^)E[Zd\- 

In the above Corollary, computation of S can be achieved robustly through Cholesky 
decomposition of (5^I+i?[j4''^^], an 0{K^) procedure. Another example is the Laplace 



prior, which can lead to a shrinkage effect (Zhu et al. 2008a I that is useful in sparse 



problems. In this paper, we focus on the normal prior and extension to the Laplace 



prior can be done similarly as in (Zhu et al. 2008a I . For the standard normal prior. 



the dual optimization problem is a QP problem and can be solved with any standard 
QP solvers, although they may not be so efficient. To leverage recent developments 
in support vector regression, we first prove the following corollary: 

Corollary 3 Assume the prior pQ^rj) = J\f{0, 1), then the mean A of q(rj) is the opti- 
mum solution of the following problem: 

D D 

min 2^""^"'^ - ^""(E |^[^^]) +Cj2(^d + td) 

d=l d=l 

yd - X^E[Zd] <e + Cd 
s.t. yd : {-yd + X^E[Zd] <€ + Q 

Cd,ed>o 
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Proof See Appendix A for details. ■ 
The above primal form can be re-formulated as a standard SVR problem and solved by 



using existing algorithms like SVM- light ( Joachims , 1999 1 to get A and the dual param- 



eters fjL and Specifically, we do Cholesky decomposition S ^ = U~^U , where U is an 

upper triangular matrix with strict positive diagonal entries. Let u = 

and we define A' = C/(A-Si/); y'^ = yd-v^^E[Zd]; and x^ = {U-^y E[Zd]. Then, the 

above primal problem in Corollary |3] can be re-formulated as the following standard 

form: 



1 ^ 



1 

y'd-{>^'V^d<e + ^d 
d 



s.t. W : {-y'd + (A')^xrf < e + 



2. M-step: Now, we estimate the unknown parameters a, /3, and 6^. Here, we assume 
a is fixed. For the update equations are the same as for sLDA: 

D N 

(3k,w 0^ X] ^(^f^" = w)(j)dnk, (7) 
d=l n=l 

For (5^, this step is similar to that of sLDA but in an expected version. The update 
rule is: 

- ^{y'^y - 2y^ E[A]E[r^] + E[7j'' EiA^ A]rj]) , (8) 
where E[r]'^ E[A^ A]r]] = tiiE[A'^ A]E[r]ri'^]). 
3.2.2 Max-Margin Learning of LDA for Regression 

In the previous section, we have presented the MedLDA regression model which uses the 
supervised sLDA to discover the latent topic representations Z. The same principle can 
be applied to perform joint maximum likelihood estimation and max-margin training for 



the unsupervised LDA Blei et al. (20031. In this section, we present this MedLDA model. 



which will be referred to as MedLDA^^^^^^i. 

A naive approach to using the unsupervised LDA for supervised prediction tasks, e.g., 
regression, is a two-step procedure: (1) using the unsupervised LDA to discover the latent 
topic representations of documents; and (2) feeding the low-dimensional topic representa- 
tions into a regression model (e.g., SVR) for training and testing. This de-coupled approach 
is rather sub-optimal because the side information of documents (e.g., rating scores of movie 
reviews) is not used in discovering the low-dimensional representations and thus can result 
in a sub-optimal representation for prediction tasks. Below, we present the MedLD A^^^^ -^^ , 
which integrates an unsupervised LDA for discovering topics with the SVR for regression. 
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The inter-play between topic discovery and supervised prediction will result in more dis- 
criminative latent topic representations, similar as in MedLDA^^^p 

When the underlying topic model is the unsupervised LDA, the likelihood is p(W|a,/3) 
as we have stated. For regression, we apply the e-insensitive support vector regression 



(SVR) Smola and Scholkopf (20031 approach as before. Again, we learn a distribution q{ri). 



The prediction rule is the same as in Eq. The integrated learning problem is defined as: 

D 

P2(MedLDA;^,,„,) : min C^iq) + KL{q{r,)\\po{7l)) +C^{^d + ^d) 

r yd - E[ri'^ Zd] < e + 
s.t. Vd: <^ -yd + E[r]'^Zd] <e + Q , 

( ^d,Q>o 

where the /CL-divergence is a regularizer that bias the estimate of q{ri) towards the prior. 
In MedLDAj^^^, this KL-regularizer is implicitly contained in the variational bound C^'^ as 
shown in Eq. ([s]). 

Variational EM- Algorithm: For MedLDA^^^.^ -^^ , the constrained optimization prob- 
lem P2 can be similarly solved with an EM procedure. Specifically, we make the same 



independence assumptions about q as in LDA (Blei et al. 20031, that is, we assume that 
q{6, z|7, (p) = Y[d=i Q{(^d\ld) n^=i Q{zdn\4'dn), where the variational parameters 7 and are 
the same as in MedLDA^^^^. By formulating a Lagrangian L for P2 and iteratively optimiz- 
ing L over each variable, we can get a variational EM-algorithm that is similar to that of 
MedLDA;^^,. 

1. E-step: The update rule for 7 is the same as in MedLDAj^^^^. For (p, by setting 
dL/dcpdi = 0, we have: 

ct>d^ oc exp{E[loge\^]+E[logp{wdim + ^{f^d - Md)), (9) 

Compared to the Eq. (5]), Eq. ([9| is simpler and does not have the complex third and 
fourth terms of Eq. ([5 ) . This simplicity suggests that the latent topic representation 
is less affected by the max-margin estimation (i.e., the prediction model's parameters). 

Set dL/dq{r]) = 0, then we get: 

q{v) = ^ exp (77^ ^{fid - f^:i)E[Zd]) 

d=l 

Plugging q{r]) into L, the dual problem D2 is the same as Dl. Again, we can choose 
different priors to introduce some regularization effects. For the standard normal 
prior: po{r]) = AA(0,/), the posterior is also a normal: q^rj) = M{X,I), where A = 
Yld=iif^d. — l^d)-^i^d] is the mean. This identity covariance matrix is much simpler 
than the covariance matrix T, as in MedLDA^j^^, which depends on the latent topic 
representation Z. Since / is independent of Z, the prediction model in MedLDAp^j,^-^i 
is less affected by the latent topic representations. Together with the simpler update 
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rule ([9|, we can conclude that the coupling between the max-margin estimation and 
the discovery of latent topic representations in MedLDA^^^^-^i is loser than that of 
the MedLDA^j^^. The loser coupling will lead to inferior empirical performance as we 
shall see. 



For the standard normal prior, the dual problem D2 is a QP problem: 

D D 



max - \ - e^{iid + ^ld) + ^yd{^^d - ^id) 

d=l d=l 

S.t. Vd : ^id,^id G [0,C], 

Similarly, we can derive its primal form, which can be reformulated as a standard 
SVR problem: 

min ^11 Alii - AT(f^ + cj^i^d + ^5) 

'^'■^ d=i d=i 

( yd - X^E[Zd] <e + U 
s.t. W : \-yd + X^E[Zd] <e + Q 

[ aa>o. 

Now, we can leverage recent developments in support vector regression to solve either 
the dual problem or the primal problem. 

2. M-step: the same as in the MedLDAj^^^. 

4. Maximum Entropy Discrimination LDA for Classification 

In this section, we consider the discrete response variable and present the MedLDA classi- 
fication model. 

4.1 Learning MedLDA for Classification 

For classification, the response variables y are discrete. For brevity, we only consider the 
multi-class classification, where y G { 1 , • • • , M} . The binary case can be easily defined based 
on a binary SVM and the optimization problem can be solved similarly. 

For classification, we assume the discriminant function F is linear, that is, F{y, zi-N,rj) = 
r]yZ, where z = ^/N'^^Zn as in the regression model, rjy is a class-specific i^-dimensional 
parameter vector associated with the class y and is a Mi^-dimensional vector by stacking 
the elements of rjy. Equivalently, F can be written as F{y, zi:N, rj) = r]~^f{y, z), where f (y, z) 
is a feature vector whose components from (y — l)K + l to yK are those of the vector z and 
all the others are 0. From each single F, a prediction rule can be derived as in SVM. Here, 
we consider the general case to learn a distribution of q{r]) and for prediction, we take the 
average over all the possible models and the latent topics: 

y* =aTgmaxE[ri^f{y,Z)\a,P]. (10) 
y 
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Now, the problem is to learn an optimal set of parameters a,/3 and distribution q{r]). 
Below, we present the MedLDA classification model. In principle, we can develop two 



variants of MedLDA classification models, which use the supervised sLDA (Wang et al. 



2009 1 and the unsupervised LDA to discover latent topics as in the regression case. However, 



for the case of using supervised sLDA for classification, it is impossible to derive a dual 
formulation of its optimization problem because of the normalized non-Gaussian prediction 
model (Blei and McAuliffe 2007 Wang et al. 20091. Here, we consider the case where 



we use the unsupervised LDA as the underlying topic model to discover the latent topic 
representation Z. As we shall see, the MedLDA classification model can be easily learned 
by using existing SVM solvers to optimize its dual optimization problem. 



4.1.1 Max-Margin Learning of LDA for Classification 

As we have stated, the supervised sLDA model has a normalization factor that makes the 
learning generally intractable, except for some special cases like the normal distribution 
as in the regression case. In (Blei and McAuliffe 2007 Wang et al. 20091, variational 



methods or high-order Taylor expansion is applied to approximate the normalization factor 
in classification model. In our max-margin formulation, since our target is to directly 
minimize a hinge loss, we do not need a normalized distribution model for the response 
variables Y. Instead, we define a partially generative model on (0,z, W) only as in the 
unsupervised LDA, and for the classification (i.e., from Z to Y), we apply the max-margin 
principle, which does not require a normalized distribution. Thus, in this case, the likelihood 
of the corpus V is p(W|a,/3). 

Similar as in the MedLDA^^^.^-^^ regression model, we define the integrated latent topic 
discovery and multi-class classification model as follows: 

D 

P3(MedLDA^): min + KL(g(r?)| |po(r?)) + C V 

s.t. Vd, y^ya: E[r]'^ AUy)] > 1 - ^d; > 0, 

where q{6,z\^, cj)) is a variational distribution; C^{q) is a variational upper bound of — logp(W|a, 
Afrf(y) = i{yd,Zd) — i{y,Zd), and ^ are slack variables. E[r]^ Aid{y)] is the ^^expected mar- 
gin" by which the true label y^ is favored over a prediction y. These margin constraints 
make MedLDA'^ fundamentally different from the mixture of conditional max-entropy mod- 



els (Pavlov et al. 20031, where constraints are based on moment matching, i.e., empirical 



expectations of features are equal to their model expectations. 

The rationale underlying the MedLDA'^ is similar to that of the MedLDA*", that is, we 
want to find a latent topic representation z|7, </>) and a parameter distribution g(r/) 
which on one hand tend to predict as accurate as possible on training data, while on the 
other hand tend to explain the data well. The KL-divergence term in P3 is a regularizer of 
the distribution q{iri). 



4.2 Variational EM-Algorithm 

As in MedLDA*", we can develop a similar variational EM algorithm. Specifically, we assume 
that q is fully factorized, as in the standard unsupervised LDA. Then, E[rj^ i{y, Z^)] = 
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S[r/]Tf(y, (l)dn). We formulate the La grangian L of P3: 

D D D 

L = C{q)+KL{qmp^{ri)) + CY,ia -Y.'^did ~Y.T. M<i(2/)(i?h^Af,(y)] + - 1) 

d=l d=l d=ly:^ya 

D N K 

-^^Cd^(Y^ (l)dtj - 1), 
d=l i=l j=l 

where the last term is from the normalization condition X^jLi 'Pdij = 1; Vi,d. The EM- 
algorithm iteratively optimizes L w.r.t 7, (p, q{rj) and (3. Since the constraints in P3 are not 
on 7 or (5, their update rules are the same as in MedLDA^^^^ and we omit the details here. 
We explain the optimization of L over (p and q{r}) and show the insights of the max-margin 
topic model: 

1. Optimize L over 4>: again, since q is fully factorized, we can perform the optimization 
on each document separately. Set dL/dcpdi = 0, then we have: 

(Pd^ oc exp( S[log0|7] + E[logp{wdi\P)] + ^ Yl ^'d{y)E[7]y^ - r,y]). (11) 

y¥=yd 



The first two terms in Eq. (11) are the same as in the unsupervised LDA and the 
last term is due to the max-margin formulation of P3 and reflects our intuition that 
the discovered latent topic representation is influenced by the max-margin estimation. 
For those examples that are around the decision boundary, i.e., support vectors, some 
of the lagrange multipliers are non-zero and thus the last term acts as a regularizer 
that biases the model towards discovering a latent representation that tends to make 
more accurate prediction on these difficult examples. Moreover, this term is fixed for 
words in the document and thus will directly affect the latent representation of the 
document (i.e., 7^) and will yield a discriminative latent representation, as we shall 
see in Section [6] which is more suitable for the classification task. 

2. Optimize L over q{rj): Similar as in the regression model, we have the following 
optimum solution. 

Corollary 4 The optimum solution q{r]) of MedLDA'^ has the form: 

1 ^ 

= ^Po(r?)exp {f{Y^ Yl MEmm), (12) 

The lagrange multipliers fi are the optimum solution of the dual problem: 

D 

D3 : max - log Z -h ^ ^ ^^(y) 

s.t. yd: ^My)e[o,c], 

yj^yd 
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Again, we can choose different priors in MedLDA'^ for different regularization effects. 
We consider tlie normal prior in tliis paper. For the standard normal prior po{r]) — 
J\f{0,I), we can get: q{r]) is a normal with a shifted mean, i.e., q{r]) = J\f{X,I), 
where A = X^^i Tliy^y^ ^d(2/)-^[^fd(y)]) and the dual problem D3 is the same as the 
dual problem of a standard multi-class SVM that can be solved using existing SVM 



methods (Crammer and Singer, 2001 1 : 



D 



D 



max 



s.t. yd: f,d{y) G [0,C]. 



y¥=yd 



5. MedTM: a general framework 

We have presented MedLDA, which integrates the max-margin principle with an underlying 
LDA model, which can be supervised or unsupervised, for discovering predictive latent topic 
representations of documents. The same principle can be applied to other generative topic 



models, such as the correlated topic models (CTMs) (Blei and LafFerty 2005), as well as 



undirected random fields, such as the exponential family harmoniums (EFH) (Welling et al. 



20041. 



Formally, the max-entropy discrimination topic models (MedTM) can be generally de- 
fined as: 



P(MedTM) : min Aq{H)) + KL{q{T)\\po{T)) + U{0 



S.t. expected margin constraints. 



where H are hidden variables (e.g., {0,z) in LDA); T are the parameters of the model 
pertaining to the prediction task (e.g., r] in sLDA); ^ are the parameters of the underlying 
topic model (e.g., the Dirichlet parameter a); and >C is a variational upper bound of the 
negative log likelihood associated with the underlying topic model. C/ is a convex function 
over slack variables. For the general MedTM model, we can develop a similar variational 
EM-algorithm as for the MedLDA. Note that T can be a part of H. For example, the 
underlying topic model of MedLDA*^ is a Bayesian sLDA. In this case, H = (9,z,r]), T = f/i 
and the term KL{q{ri)\\pQ(r])) is contained in its C. 

Finally, based on the recent extension of maximum entropy discrimination (MED) 
(Jaakkola et al. 1999) to the structured prediction setting (Zhu et al. 2008b|), the ba- 



sic principle of MedLDA can be similarly extended to perform structured prediction, where 
multiple response variables are predicted simultaneously and thus their mutual dependen- 
cies can be exploited to achieve global consistent and optimal predictions. Likelihood based 
structured prediction latent topic models have been developed in different scenarios, such 



as image annotation (He and Zemel, 2008) and statistical machine translation (Zhao and 



Xing 2006). The extension of MedLDA to structured prediction setting could provide a 



promising alternative for such problems. 



16 



MedLDA: a General Framework of Maximum Margin Supervised Topic Models 



6. Experiments 

In this section, we provide qualitative as well as quantitative evaluation of MedLDA on text 
modeling, classification and regression. 



6.1 Text Modeling 

We study text modeling of the MedLDA on the 20 Newsgroups data set with a standard 
list of stop word^ removed. The data set contains postings in 20 related categories. We 
compare with the standard unsupervised LDA. We fit the dataset to a 110-topic MedLDA"^ 
model, which explores the supervised category information, and a 110-topic unsupervised 
LDA. 

Figure [2] shows the 2D embedding of the expected topic proportions of MedLDA"^ and 



LDA by using the t-SNE stochastic neighborhood embedding (van der Maaten and Hinton 



20081, where each dot represents a document and color-shape pairs represent class labels. 
Obviously, the max-margin based MedLDA"^ produces a better grouping and separation of 
the documents in different categories. In contrast, the unsupervised LDA does not produce 
a well separated embedding, and documents in different categories tend to mix together. A 



similar embedding was presented in ( Lacoste- JuUien et al. 2008 ) , where the transformation 



matrix in their model is pre-designed. The results of MedLDA^ in Figure [2] are automatically 
learned. 

It is also interesting to examine the discovered topics and their association with class 
labels. In Figure |3] we show the top topics in four classes as discovered by both MedLDA 
and LDA. Moreover, we depict the per-class distribution over topics for each model. This 
distribution is computed by averaging the expected latent representation of the documents 
in each class. We can see that MedLDA yields sharper, sparser and fast decaying per-class 
distributions over topics which have a better discrimination power. This behavior is in fact 



due to the regularization effect enforced over (p as shown in Eq. (11). On the other hand, 
LDA seems to discover topics that model the fine details of documents with no regard to their 
discrimination power (i.e. it discovers different variations of the same topic which results in 
a flat per-class distribution over topics). For instance, in the class comp. graphics, MedLDA 
mainly models documents in this class using two salient, discriminative topics (T69 and Til) 
whereas LDA results in a much flatter distribution. Moreover, in the cases where LDA and 
MedLDA discover comparably the same set of topics in a given class (like politics. mi deast 
and misc.forsale), MedLDA results in a sharper low dimensional representation. 



6.2 Prediction Accuracy 

In this subsection, we provide a quantitative evaluation of the MedLDA on prediction 
performance. 



6.2.1 Classification 

We perform binary and multi-class classification on the 20 Newsgroup data set. To obtain 
a baseline, we first fit all the data to an LDA model, and then use the latent representation 

4. http://mallet.cs.umass.edu/ 
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Figure 3: Top topics under each class as discovered by the MedLDA and LDA models 



of the training documents as features to build a binary /multi-class SVM classifier. We 
denote this baseline by LDA+SVM. For a model A4, we evaluate its performance using the 

1 , • • i i- • precision(Ai) — precision(LDA+SVM) 

relative improvement ratio, i-e -, ^ precision(LDA+ svM ) ■ 

Note that since DiscLDA ( Lacoste- JuUien et al. , 20081 is using the Gibbs sampling for 



inference, which is slightly different from the variational methods as in MedLDA and sLDA 
(Blei and McAuliffel 120071 iWang et al.l 120091), we build the basehne model of LDA+SVM 



with both variational inference and Gibbs sampling. The relative improvement ratio of each 
model is computed against the baseline with the same inference method. 

Binary Classification: As in ( Lacoste- JuUien et al. 20081, the binary classification 



is to distinguish postings of the newsgroup alt. atheism and the postings of the group 
talk. religion. misc. We compare MedLDA'^ with sLDA, DiscLDA and LDA+SVM. For sLDA, 

5. We use the training/testing split in: 

|http://people.csail.mit.edu/jrennie/20Newsgroups/j 
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Figure 4: Relative improvement ratio against LDA+SVM for: (a) binary and (b) multi-class classification. 



the extension to perform multi-class classification was presented by Wang et al. ( 2009 ) , we 



will compare with it in the multi-class classification setting. Here, for binary case, we fit 
an sLDA regression model using the binary representation (0/1) of the classes, and use a 
threshold 0.5 to make prediction. For MedLDA'^, to see whether a second-stage max-margin 
classifier can improve the performance, we also build a method MedLDA+SVM, similar to 
LDA-I-SVM. For all the above methods that utilize the class label information, they are fit 
ONLY on the training data. 



We use the SVM- light (Joachims, 19991 to build SVM classifiers and to estimate q{rj) 
in MedLDA*^. The parameter C is chosen via 5 fold cross-validation during the training 
from {k'^ : k = 1, ■ ■ ■ , 8}. For each model, we run the experiments for 5 times and take the 
average as the final results. The relative improvement ratios of different models with respect 



to topic numbers are shown in Figure 4(a) For the DiscLDA (Lacoste-JuUien et al. 20081, 
the number of topics is set by the equation 2Kq + Ki, where Kq is the number of topics 



per class and Ki is the number of topics shared by all categories. As in (Lacoste-Jullien 
et al. 2008), Ki = 2Kq. Here, we set Kq = 1, • • • , 8, 10 and align the results with those of 



MedLDA and sLDA that have the closest topic numbers. 

We can see that the max-margin based MedLDA'^ works better than sLDA, DiscLDA and 
the two-step method of LDA+SVM. Since MedLDA'^ integrates the max-margin principle 
in its training, the combination of MedLDA and SVM does not yield additional benefits 
on this task. We believe that the slight differences between MedLDA and MedLDA -l-SVM 
are due to tuning of the regularization parameters. For efficiency, we do not change the 
regularization constant C during training MedLDA*^. The performance would be improved 
if we select a good C in different iterations because the data representation is changing. 

Multi-class Classification: We perform multi-class classification on 20 Newsgroups 
with ah the categories. We compare MedLDA^ with MedLDA-hSVM, LDA-hSVM, multi- 
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Figure 5: Predictive R'^ (left) and per- word likelihood (right) of different models on the movie review 
dataset. 



class sLDA (multi-sLDA) ( |Wang et al.||2009| ), and DiscLDA. We use the SVM''*™^* packagfQ 



with a 0/1 loss to solve the sub-step of learning q{r]) and build the SVM classifiers for 
LDA-I-SVM and MedLDA-|-SVM. The results are shown in Figure [4(b)| For DiscLDA, 
we use the same equation as in ( Lacoste- Jullien et al. 2008 1 to set the number of topics 
and set Kq = 1, • • • ,5. Again, we need to align the results with those of MedLDA based 
on the closest topic number criterion. We can see that all the supervised topic models 
discover more predictive topics for classification, and the max-margin based MedLDA*^ 
can achieve significant improvements with an appropriate number (e.g., > 80) of topics. 
Again, we believe that the slight difference between MedLDA'^ and MedLDA-|-SVM is due 
to parameter tuning. 



6.2.2 Regression 



We evaluate the MedLDA'' model on the movie review data set. As in (Blei and McAuliffe 



20071, we take logs of the response values to make them approximately normal. We compare 
MedLDA'' with the unsupervised LDA and sLDA. As we have stated, the underlying topic 
model in MedLDA'' can be a LDA or a sLDA. We have implemented both, as denoted by 
MedLDA (partial) and MedLDA (full), respectively. For LDA, we use its low dimensional 
representation of documents as input features to a linear SVR and denote this method 
by LDA+SVR. The evaluation criterion is predictive R'^ (pR^) as defined in (Blei and 
McAulifFej [20071 ). 

Figure [5] shows the results together with the per- word likelihood. We can see that the 
supervised MedLDA and sLDA can get much better results than the unsupervised LDA, 
which ignores supervised responses. By using max-margin learning, MedLDA (full) can get 
slightly better results than the likelihood-based sLDA, especially when the number of topics 
is small (e.g., < 15). Indeed, when the number of topics is small, the latent representation 
of sLDA alone does not result in a highly separable problem, thus the integration of max- 



6 . http : / /svmlight . joachims .org/ svm_multiclass. html 
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Figure 6: Training time of different models with respect to the number of topics for binary 
classification. 



margin training helps in discovering a more discriminative latent representation using the 
same number of topics. In fact, the number of support vectors (i.e., documents that have at 
least one non-zero lagrange multiplier) decreases dramatically at T = 15 and stays nearly 
the same for T > 15, which with reference to Eq. (|5| explains why the relative improvement 
over sLDA decreased as T increases. This behavior suggests that MedLDA can discover 
more predictive latent structures for difficult, non-separable problems. 

For the two variants of MedLDA*", we can see an obvious improvement of MedLDA 
(full). This is because for MedLDA (partial), the update rule of 4> does not have the third 
and fourth terms of Eq. ([s]) . Those terms make the max-margin estimation and latent topic 
discovery attached more tightly. Finally, a linear SVR on the empirical word frequency gets 
a pR^ of 0.458, worse than those of sLDA and MedLDA. 

6.2.3 Time Efficiency 

For binary classification, MedLDA*^ is much more efficient than sLDA, and is comparable 
with the LDA-I-SVM, as shown in Figure ??. The slowness of sLDA may be due to the mis- 
matching between its normal assumption and the non-Gaussian binary response variables, 
which prolongs the E-step. For multi-class classification, the training time of MedLDA'^ is 
mainly dependent on solving a multi-class SVM problem, and thus is comparable to that 
of LDA. For regression, the training time of MedLDA (full) is comparable to that of sLDA, 
while MedLDA (partial) is more efficient. 



7. Related Work 



Latent Dirichlet Allocation (LDA) (Blei et al. 2003) is a hierarchical Bayesian model for 
discovering latent topics in a document collection. LDA has found wide applications in 
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information retrieval, data mining, computer vision, and etc. The LDA is an unsupervised 
model. 



Supervised LDA (Blei and McAuliffe 2007) was proposed for regression problem. Al- 
though the sLDA was generalized to classification with a generalized linear model (GLM), 
no results have been reported on the classification performance of sLDA. One important 
issue that hinders the sLDA to be effectively applied for classification is that it has a nor- 
malization factor because sLDA defines a fully generative model. The normalization factor 
makes the learning very difficult, where variatioinal method or higher-order statistics must 
be applied to deal with the normalizer, as shown in (Blei and McAuliffe 2007 1 . Instead, 
MedLDA applies the concept of margin and directly concentrates on maximizing the mar- 
gin. Thus, MedLDA does not need to define a fully generative model, and the problem 
of MedLDA for classification can be easily handled via solving a dual QP problem, in the 
same spirit of SVM. 

DiscLDA ( Lacoste- JuUien et al. 20081 is another supervised LDA model, which was 
specifically proposed for classification problem. DiscLDA also defines a fully generative 
model, but instead of minimizing the evidence, it minimizes the conditional likelihood, in the 
same spirit of conditional random fields (Lafferty et al. 2001 1 . Our MedLDA significantly 
differs from the DiscLDA. The implementation of MedLDA is extremely simple. 

Other variants of topic models that leverage supervised information have been devel- 



oped in different application scenarios, including the models for online reviews (Titov and 



McDonald, 2008 Branavan et al. 2008), image annotation (He and Zemel, 20081 and the 



credit attribution Labeled LDA model (Ramage et al. , 2009). 

Maximum entropy discrimination (MED) ( [Jaakkola et al. 1999) principe provides an 
excellent combination of max-margin learning and Bayesian-style estimation. Recent work 
(Zhu et al. 2008b) extends the MED framework to the structured learning setting and 
generalize to incorporate structured hidden variables in a Markov network. MedLDA is an 



application of the MED principle to learn a latent Dirichlet allocation model. Unlike ( Wes- 



terdijk and Wiegerinck 20001, where a generative model is degenerated to a deterministic 



version for classification, our model is generative and thus can discover the latent topics 
over document collections. 

The basic principle of MedLDA can be generalized to the structured prediction setting, 
in which multi- variant response variables are predicted simultaneously and thus their mutual 
dependencies can be explored to achieve globally consistent and optimal predictions. At 
least two scenarios are within our horizon that can be directly solved via MedLDA, i.e., the 
image annotation (He and Zemel 20081, where neighboring annotation tends to be smooth, 
and the statistical machine translation (Zhao and Xing, 20061, where tokens are naturally 
aligned in word sentences. 



8. Conclusions and Discussions 

We have presented the maximum entropy discrimination LDA (MedLDA) that uses the 
max-margin principle to train supervised topic models. MedLDA integrates the max-margin 
principle into the latent topic discovery process via optimizing one single objective function 
with a set of expected margin constraints. This integration yields a predictive topic represen- 
tation that is more suitable for regression or classification. We develop efficient variational 
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methods for MedLDA. The empirical results on movie review and 20 Newsgroups data sets 
show the promise of MedLDA on text modeling and prediction accuracy. 

MedLDA represents the first step towards integrating the max-margin principle into 
supervised topic models, and under the general MedTM framework presented in Section 
3, several improvements and extensions are in the horizon. Specifically, due to the nature 
of MedTM's joint optimization formulation, advances in either max-margin training or 
better variational bounds for inference can be easily incorporated. For instance, the mean 
field variational upper bound in MedLDA can be improved by using the tighter collapsed 
variational bound (Teh et al. 20061 that achieves results comparable to collapsed Gibbs 
sampling (Griffiths and Steyvers 2004 1 . Moreover, as the experimental results suggest. 



incorporation of a more expressive underlying topic model enhances the overall performance. 
Therefore, we plan to integrate and utilize other underlying topic models like the fully 
generative sLDA model in the classification case. Finally, advanced in max-margin training 
would also results in more efficient training. 
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Proof of Corollary 3 

In this section, we prove the corollary 3. 

Proof Since the variational parameters (7, cj)) are fixed when solving for g(r/), we can ignore 
the terms in C^'^ that do not depend on q{rf) and get the function 

4%)]- KL{qmpo{r,)) -Y,E,[\ogp{yd\Z,,r^,5'')] 

d 

1 ^ 

d=l 

where c is a constant that does not depend on q{ri). 

Let U{^,C*) = CY,d=ii^d + id)- Suppose (go(^), Co, ?o) is the optimal solution of PI, 
then we have: for any feasible 

From Corollary [2] we conclude that the optimum predictive parameter distribution is 
Q.o{v) = -^(-^0) S), where Ti = {I + \/6'^E[A^ A\)~^ does not depend on q{ri). Since qo{'q) is 
also normal, for any distributioE[^g(ry) = AA(A, S), with several steps of algebra it is easy to 

7. Although the feasible set of 5(77) in PI is much richer than the set of normal distributions with the 
covariance matrix E, Corollary [2] shows that the solution is a restricted normal distribution. Thus, it 
suffices to consider only these normal distributions in order to learn the mean of the optimum distribution. 
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show that 

d=l d=l 

where c' is another constant that does not depend on A. 
Thus, we can get: for any (A,^, ^*), where 

(A, ^, e) e {(A, a ■■ yd - X'EiZd] <e + U; -yd + X^ElZ^] < e + C^; and C > Vd}, 
we have 

-ATs-%-A^(^|E[Z,]) + t/(eo,^5) < -A^S-iA-A^(^|s[Z,]) + C/(C,n, 

d=l d=l 

which means the mean of the optimum posterior distribution under a Gaussian MedLDA 
is achieved by solving a primal problem as stated in the Corollary. 
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